Common formats

Sequence formats

`fasta`: plain text records that contain header and sequence (nucleic acid or protein)



In [1]:

    
cat ../data/rose.fa









    



>ROSE1
GCCGCCGAGAGGUGGCGUCCCCGACGCCUCAUGGGUCGGGAACGACUGAGACGGGCACCG
GUCGUGUCCGGUGCCGCUCGUAUCCAUUUUGCUCCUUGGAGGAUUUGGCUAUGCGCA



In [2]:

    
head ../data/contigs.fasta









    



>NODE_1667_length_78_cov_19440_ID_3333
AGTGACGCTAATGATGCCTACAATGCCCCGGAGACTGGGCTGTGTAGGTGCGTTCGCCTC
CAGCTATCATCGTCCGGG
>NODE_1664_length_78_cov_14158.5_ID_3327
GCTGCAGATGGCCGCCGCCTTCGAGCACTTCAACGAGGTGCATCCGCACTCGGCGCTGAA
GATGAAATCACCCGGAAA
>NODE_1663_length_79_cov_10190.5_ID_3325
CGAGGTGCATCCGCACTCGGCGCTGAAGATGAAATCACCCGGAGAGTTAAGGCAGCATCG
TGCCTCCCCACAGGGGAGA
>NODE_1661_length_80_cov_11498_ID_3321

`fastq`: plain text records that contain HTS sequence reads and associated quality scores



In [3]:

    
zcat ../data/BJ-HSR1_R1.fastq.gz | head









    



@NS500159:12:H2FJ5AFXX:1:11101:12552:1058 1:N:0:1
NCACTNCATCGAGTGCAAAAAGATGGCGGATGCACTGCCAACCAGCATGGCAATTGCAATGCAGATAGCACGCCAGCGACCTAGATAGGTCAGCGCAGCAAGCTGGTCATGGAGGTGGAAACGTGGCTTCATCTGTACAACGAGTTAGAT
+
#<<<.#FFFFFAFFFF.AA7FAF7AFFAFFF<FA<<A.F.A<FF.FA<FAF..F<AFF<AAF...<<.<F77A<<FFA<.AF.AFF.<.AF.7<7<<7<F<F7FF<A..F<F.<FA...<<<.7.F<FF7FA<A...<7.<<.7FF77.<
@NS500159:12:H2FJ5AFXX:1:11101:6058:1058 1:N:0:1
NCTAANGAATGATTGTAGTCCTCACACCACTCAAACGTCAGCTTATGGCGGCGGATTCAGCCGGTCGATGCAACACACTATGACACGGCCCAGGCCGAAGGAGTGCTGCAGATGGCCGCCGCCATCGAGAACATCCACGCGGAGCAACCC
+
#..7A#F<F7FFFF.AAF)FFFF.F)FF.7F.)<FFFFF)77F)..FFFF.FFFFFF<..F<F.)A<.FF..FFA.<F.AF7.A.7)<FAA.F7FA.AAF<..F7F<.FF..F7.FAFFAF.A.A7..AA...7<<))<F7F77A.)77)
@NS500159:12:H2FJ5AFXX:1:11101:24977:1060 1:N:0:1
NTGCANCACACTATGACACGGCCCAGGCCGAAGGAGTGCTGCAGATGGCCGCCGCCTTCGAGCACTTCAACGAGGTGCATCCGCACTCGGCGCTGAAGCTGAAATCACCCGGAGAGTTAAGGCAGCATCGTGCCTCCCCACAGGGGCGTTA

gzip: stdout: Broken pipe

Quality scores

$$Q_{phred}=-10log_{10}(e)$$

where:

$Q_{phred}$ - quality score
e - probability of a base being called wrong

How to encode it to text?

$Q_{phred} + 33$

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL................................................. 
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
|                         |    |        |                              |                     |
33                        59   64       73                            104                   126
 0........................26...31.......40

@NS500159:12:H2FJ5AFXX:1:11101:12552:1058 1:N:0:1:

@NS500159 - machine id
12 - run number
H2FJ5AFXX - flowcell id
1 - lane
11101 - tile number
12552:1058 - x and y coordinates
1 - read 1 or 2 (for paire ends)
N - filtered (Y) or not (N)
0 - always 0 for HiSeq and NextSeq
1 - sample no from the sample sheet

`genebank`: similar to `fasta` but allows annotations (metadata).

Old and busted. No standard, very buggy. Very bad format!

Alignment formats

`sam` plain text records that contain read and alignment info

Header lines start with @ and contain metadata: reference sequences names, lengths, aligner, etc.

Each alignment record contains 11 mandatory fields:

QNAME - query template name (think header from fastq file)
FLAG - bitwise flag (more on it in a moment)
RNAME - reference sequence name (e.g. chr1)
POS - 1-based left-most mapping position
MAPQ - mapping quality (think uniqueness of the mapping)
CIGAR - details of the mapping (match/mismatch/indel/clipping etc)
RNEXT - reference sequence name for the pair (mate)
PNEXT - mapping position for the pair (mate)
TLEN - template (query) length
SEQ - (aligned) segment sequence (not necessarily entire query sequence)
QUAL - quality, as in fastq

`FLAG` field

This is possibly the most important field in practical terms.

1 0x1 template having multiple segments in sequencing
2 0x2 each segment properly aligned according to the aligner
4 0x4 segment unmapped
8 0x8 next segment in the template unmapped
16 0x10 SEQ being reverse complemented
32 0x20 SEQ of the next segment in the template being reverse complemented
64 0x40 the first segment in the template
128 0x80 the last segment in the template
256 0x100 secondary alignment
512 0x200 not passing filters, such as platform/vendor quality controls
1024 0x400 PCR or optical duplicate
2048 0x800 supplementary alignment

`bam` binary (compressed) form of `sam`

Same as sam but compresses and therefore is not directly readable. But because of the compression efficiency, it is the preferred way of storing alignment data.

You don't usually work with these directly, rather they are produced as intermediate results that get processed further to yield biologically relevant insights.

These are result of any alignment to reference you perform.

`pileup` tab delimited; records contain aggregate alignment data per reference position

reference name
reference position
reference base
read depth at this position
read bases (total=read depth):
- . match on the forward strand
- , match on the reverse strand
- ACTGN mismatch on forward strand
- actgn mismatch on reverse strand
- +|-[0-9]ACTGNactgn insertion | deletion
- ^ start of the read segment
- $ end of the read segment
read base qualities

Position (genomic interval) formats

gff (former gtf) genomic feature format; tab-delimited plain text
bed generic position format
vcf variant call format

Common tools

samtools

bedtools

bowtie/bowtie2

tuxedo suite

NCBI blast

Assemblers: velvet, trinity, SPades

Genome browsers

New hotness: kallisto/sleuth

Examples of pipelines/workflows

RNA-seq experiment

SHAPE-seq

Net-Seq



In [ ]: